74 research outputs found

    Scalable Boolean Tensor Factorizations using Random Walks

    Full text link
    Tensors are becoming increasingly common in data mining, and consequently, tensor factorizations are becoming more and more important tools for data miners. When the data is binary, it is natural to ask if we can factorize it into binary factors while simultaneously making sure that the reconstructed tensor is still binary. Such factorizations, called Boolean tensor factorizations, can provide improved interpretability and find Boolean structure that is hard to express using normal factorizations. Unfortunately the algorithms for computing Boolean tensor factorizations do not usually scale well. In this paper we present a novel algorithm for finding Boolean CP and Tucker decompositions of large and sparse binary tensors. In our experimental evaluation we show that our algorithm can handle large tensors and accurately reconstructs the latent Boolean structure

    Reductions for Frequency-Based Data Mining Problems

    Full text link
    Studying the computational complexity of problems is one of the - if not the - fundamental questions in computer science. Yet, surprisingly little is known about the computational complexity of many central problems in data mining. In this paper we study frequency-based problems and propose a new type of reduction that allows us to compare the complexities of the maximal frequent pattern mining problems in different domains (e.g. graphs or sequences). Our results extend those of Kimelfeld and Kolaitis [ACM TODS, 2014] to a broader range of data mining problems. Our results show that, by allowing constraints in the pattern space, the complexities of many maximal frequent pattern mining problems collapse. These problems include maximal frequent subgraphs in labelled graphs, maximal frequent itemsets, and maximal frequent subsequences with no repetitions. In addition to theoretical interest, our results might yield more efficient algorithms for the studied problems.Comment: This is an extended version of a paper of the same title to appear in the Proceedings of the 17th IEEE International Conference on Data Mining (ICDM'17

    Algorithms for Approximate Subtropical Matrix Factorization

    Get PDF
    Matrix factorization methods are important tools in data mining and analysis. They can be used for many tasks, ranging from dimensionality reduction to visualization. In this paper we concentrate on the use of matrix factorizations for finding patterns from the data. Rather than using the standard algebra -- and the summation of the rank-1 components to build the approximation of the original matrix -- we use the subtropical algebra, which is an algebra over the nonnegative real values with the summation replaced by the maximum operator. Subtropical matrix factorizations allow "winner-takes-it-all" interpretations of the rank-1 components, revealing different structure than the normal (nonnegative) factorizations. We study the complexity and sparsity of the factorizations, and present a framework for finding low-rank subtropical factorizations. We present two specific algorithms, called Capricorn and Cancer, that are part of our framework. They can be used with data that has been corrupted with different types of noise, and with different error metrics, including the sum-of-absolute differences, Frobenius norm, and Jensen--Shannon divergence. Our experiments show that the algorithms perform well on data that has subtropical structure, and that they can find factorizations that are both sparse and easy to interpret.Comment: 40 pages, 9 figures. For the associated source code, see http://people.mpi-inf.mpg.de/~pmiettin/tropical

    On Defining SPARQL with Boolean Tensor Algebra

    Full text link
    The Resource Description Framework (RDF) represents information as subject-predicate-object triples. These triples are commonly interpreted as a directed labelled graph. We propose an alternative approach, interpreting the data as a 3-way Boolean tensor. We show how SPARQL queries - the standard queries for RDF - can be expressed as elementary operations in Boolean algebra, giving us a complete re-interpretation of RDF and SPARQL. We show how the Boolean tensor interpretation allows for new optimizations and analyses of the complexity of SPARQL queries. For example, estimating the size of the results for different join queries becomes much simpler

    Clustering Boolean Tensors

    Full text link
    Tensor factorizations are computationally hard problems, and in particular, are often significantly harder than their matrix counterparts. In case of Boolean tensor factorizations -- where the input tensor and all the factors are required to be binary and we use Boolean algebra -- much of that hardness comes from the possibility of overlapping components. Yet, in many applications we are perfectly happy to partition at least one of the modes. In this paper we investigate what consequences does this partitioning have on the computational complexity of the Boolean tensor factorizations and present a new algorithm for the resulting clustering problem. This algorithm can alternatively be seen as a particularly regularized clustering algorithm that can handle extremely high-dimensional observations. We analyse our algorithms with the goal of maximizing the similarity and argue that this is more meaningful than minimizing the dissimilarity. As a by-product we obtain a PTAS and an efficient 0.828-approximation algorithm for rank-1 binary factorizations. Our algorithm for Boolean tensor clustering achieves high scalability, high similarity, and good generalization to unseen data with both synthetic and real-world data sets

    Latitude: A Model for Mixed Linear-Tropical Matrix Factorization

    Full text link
    Nonnegative matrix factorization (NMF) is one of the most frequently-used matrix factorization models in data analysis. A significant reason to the popularity of NMF is its interpretability and the `parts of whole' interpretation of its components. Recently, max-times, or subtropical, matrix factorization (SMF) has been introduced as an alternative model with equally interpretable `winner takes it all' interpretation. In this paper we propose a new mixed linear--tropical model, and a new algorithm, called Latitude, that combines NMF and SMF, being able to smoothly alternate between the two. In our model, the data is modeled using the latent factors and latent parameters that control whether the factors are interpreted as NMF or SMF features, or their mixtures. We present an algorithm for our novel matrix factorization. Our experiments show that our algorithm improves over both baselines, and can yield interpretable results that reveal more of the latent structure than either NMF or SMF alone.Comment: 14 pages, 6 figures. To appear in 2018 SIAM International Conference on Data Mining (SDM '18). For the source code, see https://people.mpi-inf.mpg.de/~pmiettin/linear-tropical

    Boolean matrix factorization meets consecutive ones property

    Get PDF
    Boolean matrix factorization is a natural and a popular technique for summarizing binary matrices. In this paper, we study a problem of Boolean matrix factorization where we additionally require that the factor matrices have consecutive ones property (OBMF). A major application of this optimization problem comes from graph visualization: standard techniques for visualizing graphs are circular or linear layout, where nodes are ordered in circle or on a line. A common problem with visualizing graphs is clutter due to too many edges. The standard approach to deal with this is to bundle edges together and represent them as ribbon. We also show that we can use OBMF for edge bundling combined with circular or linear layout techniques. We demonstrate that not only this problem is NP-hard but we cannot have a polynomial-time algorithm that yields a multiplicative approximation guarantee (unless P = NP). On the positive side, we develop a greedy algorithm where at each step we look for the best 1-rank factorization. Since even obtaining 1-rank factorization is NP-hard, we propose an iterative algorithm where we fix one side and and find the other, reverse the roles, and repeat. We show that this step can be done in linear time using pq-trees. We also extend the problem to cyclic ones property and symmetric factorizations. Our experiments show that our algorithms find high-quality factorizations and scale well.Peer reviewe

    What you will gain by rounding : theory and algorithms for rounding rank

    Full text link
    When factorizing binary matrices, we often have to make a choice between using expensive combinatorial methods that retain the discrete nature of the data and using continuous methods that can be more efficient but destroy the discrete structure. Alternatively, we can first compute a continuous factorization and subsequently apply a rounding procedure to obtain a discrete representation. But what will we gain by rounding? Will this yield lower reconstruction errors? Is it easy to find a low-rank matrix that rounds to a given binary matrix? Does it matter which threshold we use for rounding? Does it matter if we allow for only non-negative factorizations? In this paper, we approach these and further questions by presenting and studying the concept of rounding rank. We show that rounding rank is related to linear classification, dimensionality reduction, and nested matrices. We also report on an extensive experimental study that compares different algorithms for finding good factorizations under the rounding rank model
    • …
    corecore